1 Introduction

The Scenario: For this project, we take on a the role of a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, our team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, our team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve our recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Below, we carry out this task and follow the key steps of the data analysis process: ask, prepare, process, analyse, share, and act.

2 Ask

First, we clarify the marketing department’s guiding question: How do annual members and casual riders use Cyclistic bikes differently?

The key business task is therefore to uncover the key differences between annual member and casual rider bike use.

By the end we hope to be able to present three main differences between member and casual bike use.

3 Prepare

The dataset we are to use is located here and was made available by Motivate International Inc. under this license. It consists of data from the Chicago-based bicycle sharing service “Divvy”, which, for the purpose of this project, we consider to be reliable, original, comprehensive, and current data for our fictional company “Cyclistic”.

The dataset includes data from as far back as 2013. Data is packaged as zip-files in monthly increments, but data from before 2020-04 is quarterly or bi-annual. For this project, we use the monthly data ranging from 2020-04 to 2021-07.

3.1 Assess Data Integrity

To begin, we read in the downloaded monthly data into R-Studio, bind each month together and examine the raw data using the package skim.

options(width=110)

csv_list<-dir(here("Data"),pattern=".csv")
cycle_data = tibble()
for(i in 1:length(csv_list)){
  df <- read_csv(paste(here("Data",csv_list[i])),show_col_types = FALSE)
  cycle_data <- rbind(cycle_data,df)
}
cycle_data%>%skim_tee()
## ── Data Summary ────────────────────────
##                            Values 
## Name                       data   
## Number of rows             5910616
## Number of columns          13     
## _______________________           
## Column type frequency:            
##   character                7      
##   numeric                  4      
##   POSIXct                  2      
## ________________________          
## Group variables            None   
## 
## ── Variable type: character ──────────────────────────────────────────────────────────────────────────────────
##   skim_variable      n_missing complete_rate   min   max empty n_unique whitespace
## 1 ride_id                    0         1        16    16     0  5910407          0
## 2 rideable_type              0         1        11    13     0        3          0
## 3 start_station_name    369331         0.938    10    53     0      749          0
## 4 start_station_id      369956         0.937     1    36     0     1287          0
## 5 end_station_name      409155         0.931    10    53     0      746          0
## 6 end_station_id        409616         0.931     1    36     0     1286          0
## 7 member_casual              0         1         6     6     0        2          0
## 
## ── Variable type: numeric ────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate  mean     sd    p0   p25   p50   p75  p100 hist 
## 1 start_lat             0         1      41.9 0.0437  41.6  41.9  41.9  41.9  42.1 ▁▁▇▇▁
## 2 start_lng             0         1     -87.6 0.0261 -87.9 -87.7 -87.6 -87.6 -87.5 ▁▁▂▇▁
## 3 end_lat            6905         0.999  41.9 0.0438  41.5  41.9  41.9  41.9  42.2 ▁▁▇▇▁
## 4 end_lng            6905         0.999 -87.6 0.0263 -88.1 -87.7 -87.6 -87.6 -87.4 ▁▁▁▇▁
## 
## ── Variable type: POSIXct ────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate min                 max                 median              n_unique
## 1 started_at            0             1 2020-04-01 00:00:30 2021-07-31 23:59:58 2020-11-25 11:10:06  5035619
## 2 ended_at              0             1 2020-04-01 00:10:45 2021-08-12 17:45:41 2020-11-25 10:46:32  5021291

Based on this initial read-in, we can see that there are several data validity issues:

First, the number of rows (5910616) does not match the number of unique ride_ids (5910407). This indicates that there may be duplicate entries in the dataset.

Second, for six features there are a number of rows missing, with the extent of this varying. Each of these features relate to the starting or ending location of the rides.

Third, station names and station ids have a differing number of unique values. When we explore this further, this is found to be because stations have multiple ids and that the reason for this is probably due to a renaming of a certain number of stations that took place between 2020-11-30 and 2020-12-01.

df1 <-cycle_data %>% 
  select(start_station_name,start_station_id,started_at)%>%
  group_by(start_station_name,start_station_id)%>%
  summarise(s_start_date = min(started_at),
            s_end_date=(max(started_at)),
            .groups='drop')%>%
  as.data.frame()

df2 <- cycle_data %>% 
  select(end_station_name,end_station_id,ended_at)%>%
  group_by(end_station_name,end_station_id)%>%
  summarise(e_start_date = min(ended_at),
            e_end_date=(max(ended_at)),
            .groups='drop')%>%
  as.data.frame()

full_df <- full_join(df1, df2, by = c("start_station_name" = "end_station_name","start_station_id"="end_station_id"))
full_df

3.2 Potential of data and limitations

3.2.1 Potential

The data can be used to answer several key questions, including, but not limited to, the following:

  • How long on average does it take different user types to complete a journey?
  • How far do users travel per trip on average?
  • How does the total monthly number of bike rentals per user group change throughout the year?
  • What times of day experience the most bike rentals during week days and weekends?
  • Which docking stations are the most commonly used by both user types? (if this is combined with a calculation of docking station location density, then this could help focus on which parts of the city should be targeted for further expansion, potentially increasing membership if casual users)
  • What type of bike are used by both user types?

3.2.2 Limitations

Due to privacy issues, the data does not include member, price, or payment information.

This means that it is impossible to identify repeat users, analyse user behaviour based on variables such as age or gender, or assess the impact of special offers or price changes.

4 Process

For processing, we use R and RStudio as the tool of choice. R is a programming language that is particularly useful for statistical analysis and works well with large datasets. RStudio is a versatile program that allows the user to generate reports created in R, such as this one

We predominately use the tidyverse package for this stage.

4.1 Generate new variables

Before cleaning up this data, it is useful to add some new variables:

  • cy_year
  • cy_month: numeric value from 1-12
  • cy_day: numeric value from 1-31
  • cy_day_of_week: Monday - Sunday, factorised so that Monday is considered day 1
  • cy_time_m: the time in minutes it takes to start and end a journey
  • cy_distance_m: the distance travelled between starting and ending point

The first five new variables do not need any special packages and can be accomplished with base R:

##Factorising month and day of week
month_levels <- c(1,2,3,4,5,6,7,8,9,10,11,12)
day_of_week_levels <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")

###Creating new variables
cycle_data2 <- cycle_data %>% 
  mutate(
         # cy_year_month = started_at%>%format("%Y-%m"),
         cy_year = started_at%>%format("%Y")%>%as.numeric(),
         cy_month = started_at %>%format("%m")%>%as.numeric()%>%factor(levels = month_levels),
         cy_day = started_at %>% format("%d") %>% as.numeric(),
         cy_day_of_week = started_at %>% weekdays() %>% factor(levels = day_of_week_levels),
         cy_iso_week = started_at %>% isoweek() %>% as.numeric(),
         cy_time_m = difftime(ended_at,started_at,units="mins"),
         )

The variable related to distance requires the package ‘geosphere’.

According to the R documentation, geosphere’s function distHaversine calculates the ‘shortest distance between two points (i.e., the ’great-circle-distance’ or ‘as the crow flies’)…assumes a spherical earth, [and] ignor[es] ellipsoidal effects.’

We will use it here to estimate the rough distance between the start and end points per ride:

cycle_data2 <- cycle_data2 %>%
  mutate(
    cy_distance_m = distHaversine(cbind(start_lng,start_lat),cbind(end_lng,end_lat))
  )

With these new variables added, it is now time to clean the dataset.

4.2 Data Cleaning

4.2.1 Remove duplicates

To assess duplicates we use the function ‘get_dupes’ from the lubridate package

The duplicated ride_id values are most likely due to the change in station id, mentioned earlier. For every instance of a duplicated ride_id, one of the station ids is the older one and one is newer. Each of the newer station id rides have starting times that actually occur after their ending times:

cycle_data2%>%get_dupes(ride_id)%>%select(ride_id,started_at,ended_at,start_station_name, start_station_id, end_station_name,  end_station_id,cy_time_m,cy_distance_m)

To remove these duplicates we remove all values where the time_m column is negative

cycle_data2 <- cycle_data2%>%filter(cy_time_m >=0)

#Checking to see if it worked
cycle_data2%>%summarise(n_unique =n_distinct(ride_id),
                        n_rows = n())

While on the topic of time, let’s take another look at the cy_time_m feature:

4.2.2 cy_time_m

cycle_data2%>%select(cy_time_m)%>%summarise(min_time = min(cy_time_m),
                                            av_time = mean(cy_time_m),
                                            med_time = median(cy_time_m),
                                            max_time = max(cy_time_m),
                                            sd = sd(cy_time_m))
cycle_data2 %>% ggplot() +
  geom_boxplot(aes(x=cy_time_m))+
  labs(title='Length of Cycle Journeys (Mins)',x='Distance Length')

There’s clearly quite a bit of variation here. It’s doubtful that someone would hire a bicycle for over 20000 minutes, so what’s going on?

At a resolution of 24 hours, the spread is outlined in the table below:

cycle_data2 %>% select(cy_time_m)%>%
  mutate(class_days = case_when(cy_time_m < 0 ~ "<0",
                         cy_time_m >= 0 & cy_time_m <= 1440 ~"0-1d",
                         cy_time_m >1440 & cy_time_m <=4320 ~ "1-3d",
                         cy_time_m > 4320 ~ "4+"))%>%
  group_by(class_days)%>%
  summarise(number_trips = n(),
            percent_of_total_trips = n()/nrow(cycle_data2)*100)

After looking through the data its clear that a small amount of these lengthy rides are due to bicycles being checked at the main facility. These can be removed. Since Cyclistic offers the possibility of purchasing a day pass for bicycles, it is possible that a minority of riders choose to rent bicycles for longer time periods.

Since these amount to a very small proportion of the wider dataset anyway, we leave them in here.

##remove bike checks & 
cycle_data2 <- cycle_data2%>%filter(
        # cy_time_m <= 1440 &
                       end_station_name != "HUBBARD ST BIKE CHECKING (LBS-WH-TEST)" & 
                       start_station_name != "WATSON TESTING - DIVVY")

4.2.3 cy_distance_m

A short look uncovers that there are many distances that are less than 100m:

cycle_data2 %>%mutate(class_distance = case_when(cy_distance_m <=0 ~ "<=0m",
                             cy_distance_m > 0 & cy_distance_m < 10 ~ "1-10m",
                             cy_distance_m >= 10 & cy_distance_m < 100 ~ "10-100m",
                             cy_distance_m >= 100 ~"100+m"))%>%
                        group_by(class_distance)%>%
                        summarise(n = n(),
                                  percent_of_total_trips = n()/nrow(cycle_data2)*100)

When checking the distance values, another interesting thing becomes apparent: close to 10% of all rides, 532948 rows, are below 100m. Of these approximately 80% traveled 0 meters. When we examine the starting and ending docking stations, below, we find out why this is the case:

cycle_data2 %>%filter(cy_distance_m < 1)%>%select(-rideable_type,-start_station_id,-end_station_id,-start_lat,-end_lat,-start_lng,-end_lng,-member_casual,-cy_year,-cy_month,-cy_day,-cy_day_of_week,-cy_iso_week)%>%arrange(desc(cy_time_m))

The starting and ending docking stations are the same, expoising an unfortunate feature of the adjusted dataset As we only have starting and ending stations to work with, calculating the distance between them will not effectively estimate the distance travelled when users travel a circular loop and return the bicycle to the station they picked it up from.

This restricts the conclusions we can draw from summary distance data, with the distance travelled likely to be much higher than if we leave these rows in:

cycle_data2 %>%select(cy_distance_m)%>%rename(original_values_distance_m = cy_distance_m)%>%summary()
##  original_values_distance_m
##  Min.   :    0.0           
##  1st Qu.:  879.3           
##  Median : 1690.5           
##  Mean   : 2205.4           
##  3rd Qu.: 2992.8           
##  Max.   :48425.0
cycle_data2 %>%select(cy_distance_m)%>%filter(cy_distance_m >= 1)%>%rename(above1m_distance_m = cy_distance_m)%>%summary()
##  above1m_distance_m
##  Min.   :    1     
##  1st Qu.: 1091     
##  Median : 1869     
##  Mean   : 2424     
##  3rd Qu.: 3185     
##  Max.   :48425
cycle_data2 %>%select(cy_distance_m)%>%filter(cy_distance_m >= 100)%>%rename(above100m_distance_m = cy_distance_m)%>%summary()
##  above100m_distance_m
##  Min.   :  100       
##  1st Qu.: 1118       
##  Median : 1891       
##  Mean   : 2450       
##  3rd Qu.: 3209       
##  Max.   :48425
cycle_data2 %>%select(cy_distance_m)%>%filter(cy_distance_m >= 500)%>%rename(above500m_distance_m = cy_distance_m)%>%summary()
##  above500m_distance_m
##  Min.   :  500       
##  1st Qu.: 1222       
##  Median : 1979       
##  Mean   : 2557       
##  3rd Qu.: 3302       
##  Max.   :48425

For the purpose of this analysis, however, it is useful to leave these rows in.

4.2.4 start_station_id / end_station_id

4.2.4.1 NAs

First, let’s check to see how many NA values remain in these columns

cycle_data2%>%select(start_station_id,end_station_id)%>%sapply(function(x) sum(is.na(x)))
## start_station_id   end_station_id 
##              523              404

It appears that the earlier measures reduced the number of NAs by quite a bit.

We now take a look at the rows with NA values

##initial look - 
cycle_data2%>%filter(is.na(start_station_id) | is.na(end_station_id))%>%arrange(started_at)
###checking to see if end stations are different to start stations
cycle_data2%>%filter(is.na(start_station_id))%>%select(start_station_name)%>%unique()
cycle_data2%>%filter(is.na(end_station_id))%>%select(end_station_name)%>%unique()
##save station names are not, so I save the names in a new df
station_NA_names<- cycle_data2%>%filter(is.na(end_station_id))%>%select(end_station_name)%>%rename(NAstation_names = end_station_name)%>%unique()

## there are only two stations without values, and other rows include these values. I use the df from the previous step to find the id number for each station
cycle_data2%>%
  filter(start_station_name == station_NA_names$NAstation_names & !is.na(start_station_id))%>%
  select(start_station_name,start_station_id)%>%
  distinct()

With these understood, we can now replace the values and continue

##Using the Id values to replace NA values
cycle_data2<-cycle_data2%>%
  mutate(
    start_station_id = if_else(
    start_station_name =="W Oakdale Ave & N Broadway" & is.na(start_station_id),"20252.0",
    if_else(start_station_name =="W Armitage Ave & N Sheffield Ave" & is.na(start_station_id),"20254.0",start_station_id)),
    end_station_id = if_else(
    end_station_name =="W Oakdale Ave & N Broadway" & is.na(end_station_id),"20252.0",
    if_else(end_station_name =="W Armitage Ave & N Sheffield Ave" & is.na(end_station_id),"20254.0",end_station_id)
    ))

##checking to see that NAs are removed correctly
cycle_data2%>%select(start_station_id,end_station_id)%>%sapply(function(x) sum(is.na(x)))
## start_station_id   end_station_id 
##                0                0

4.2.4.2 Replacing the duplicate ID Numbers

Here, the end goal is to be able to assess the popularity of each docking station, as defined by the number of times it has been used as a starting or ending station.

To do this we filter the distinct ids for starting and ending station ids before 2020-11-30, the transition date, from those after. We join these together and merge the id columns so that only the currently used ids are used. Following this, we use left_joins to add the updated id values to the dataset for both start_station_id and end_station_ids, replacing the original columns.

##creating list of starting stations BEFORE transition date
station_list_a_start<-cycle_data2 %>% filter(
  started_at < date("2020-12-01")
  )%>%
  distinct(
    start_station_name,
    start_station_id
  )%>%
  rename(
    station_name_a = start_station_name,
    station_id_a = start_station_id
  )%>%
  arrange(station_name_a)

##creating list of ending stations BEFORE transition date
station_list_a_end<-cycle_data2 %>% filter(
  ended_at < date("2020-12-01")
  )%>%
  distinct(
    end_station_name,
    end_station_id
  )%>%
  rename(
    station_name_a = end_station_name,
    station_id_a = end_station_id
  )%>%
  arrange(station_name_a)

##joining both lists together to create comprehensive list
station_list_a_full <- full_join(station_list_a_start, station_list_a_end)

##creating list of starting stations AFTER transition date

station_list_b_start<-cycle_data2 %>% filter(
  started_at >= date("2020-12-01")
  )%>%
  distinct(
    start_station_name,
    start_station_id
  )%>%
  rename(
    station_name_b = start_station_name,
    station_id_b = start_station_id
  )%>%
  arrange(station_name_b)

##creating list of ending stations AFTER transition date
station_list_b_end<-cycle_data2 %>% filter(
  ended_at >= date("2020-12-01")
  )%>%
  distinct(
    end_station_name,
    end_station_id
  )%>%
  rename(
    station_name_b = end_station_name,
    station_id_b = end_station_id
  )%>%
  arrange(station_name_b)

##joining both lists together to create comprehensive list
station_list_b_full <- full_join(station_list_b_start, station_list_b_end)

station_lookup<-station_list_a_full%>%
  full_join(
    .,station_list_b_full, by=c("station_name_a" = "station_name_b")
    )%>%
  mutate(
    station_id_b = if_else(is.na(station_id_b),station_id_a,station_id_b),
  )%>%
  select(-station_id_a)%>%
  rename(
    station_name = station_name_a,
    station_id = station_id_b
  )%>%
  distinct(
    station_name, .keep_all = TRUE  )%>%
  arrange(
    station_name
  )
cycle_data3 <- cycle_data2%>%
  select(-start_station_id, -end_station_id)%>%
  left_join(.,station_lookup, by = c("start_station_name" = "station_name"))%>%
  rename(start_station_id = station_id)%>%
  left_join(.,station_lookup, by =c("end_station_name" = "station_name"))%>%
  rename(end_station_id = station_id)

5 Analysis

5.1 Median time and distance spent on bicycles

  • Both casual users and members typically travel ~ 1.75 kms per trip; however casual users travel ~ 10 minutes longer than members.
  • 5.2 Rideable proportions by user type

    • Both user groups use roughly the same proportion of bicycle types.

    • Members use slightly more classic bike types than casual users, who favour docked bikes.

    6 Key Conclusions

    6.1 Members

    Bicycle use is driven by members. In 2020, particularly over winter, and up until June 20201, members have driven usage of Cyclistic bicycles. They favour docked bicycles, but also are more likely to choose classic options. They travel the same distances as casual users, but use bicycles ~ 10 minutes less per trip, indicating that they are more conditioned to bicycle travel.

    Member TOD patterns indicate that they use Cyclistic bicycles for travel to and from work.

    6.2 Casual users

    Casual use primarily takes place in Summer with 2021 usage stronger than in 2020. Casual use in April 2021 rose by 475% compared to the previous year and increases have generally been more favourable than members, indicating higher brand visibility compared to 2020. Close to half of all casual users choose docked bicycles. Casual users are less likely to use bicycles for travel to work, but, like members, are much more likely to travel in the evening, particularly after the hours of 9pm.

    6.3 Recommendations

    1. Market discounted membership packets during summer months
    2. Target marketing to those who take public transport to work.
    3. Increase number of docking stations to drive use of docked bicycles.